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Abstract 

In this work, we use algebraic methods for studying distance computation and 
subgraph detection tasks in the congested clique model. Specifically, we adapt parallel 
matrix multiplication implementations to the congested clique, obtaining an 
round matrix multiplication algorithm, where ui < 2.3728639 is the exponent of matrix 
multiplication. In conjunction with known techniques from centralised algorithmics, 
this gives significant improvements over previous best upper bounds in the congested 
clique model. The highlight results include: 

- triangle and 4-cycle counting in rounds, improving upon the 

algorithm of Dolev et al. [DISC 2012], 

- a (1 -I- o(I))-approximation of all-pairs shortest paths in rounds, improv¬ 

ing upon the 0(n^/^)-round (2 -|- o(l))-approximation algorithm of Nanongkai 
[STOC 2014], and 

- computing the girth in rounds, which is the first non-trivial solution in 

this model. 

In addition, we present a novel constant-round combinatorial algorithm for detecting 
4-cycles. 
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1 Introduction 


Algebraic methods have become a recurrent tool in centralised algorithmics, employing a 
wide range of techniques (e.g., [10-16, 20-22, 26, 28, 29, 43, 44, 50, 58, 71, 72]). In this 
paper, we bring techniques from the algebraic toolbox to the aid of distributed computing, 
by leveraging fast matrix multiplication in the congested clique model. 

In the congested clique model, the n nodes of a graph G communicate by exchanging 
messages of O(logre) size in a fully-connected synchronous network; initially, each node is 
aware of its neighbours in G. In comparison with the traditional CONGEST model [60], 
the key difference is that a pair of nodes can communicate directly even if they are not 
adjacent in graph G. The congested clique model masks away the effect of distances on the 
computation and focuses on the limited bandwidth. As such, it has been recently gaining 
increasing attention [24, 25, 36, 37, 46, 49, 51, 57, 59, 63], in an attempt to understand 
the relative computational power of distributed computing models. 

The key insight of this paper is that matrix multiplication algorithms from parallel 
computing can be adapted to obtain an round matrix multiplication algorithm 

in the congested clique, where uj < 2.3728639 is the matrix multiplication exponent [33]. 
Combining this with well-known centralised techniques allows us to use fast matrix mul¬ 
tiplication to solve various combinatorial problems, immediately giving 0(n*^'^^®)-time 
algorithms in the congested clique for many classical graph problems. Indeed, while most 
of the techniques we use in this work are known beforehand, their combination gives 
significant improvements over the best previously known upper bounds. Table 1 contains a 
summary of our results, which we overview in more details in what follows. 


Problem 


Running time 


This work 

Prior work 


matrix multiplication (semiring) 

0(n^/^) 

— 


matrix multiplication (ring) 

0^^0.158) 

0(^0.373) 

[25] 

triangle counting 

0(^0.158) 

0(n^/^/logn) 

[24] 

4-cycle detection 

0(1) 

0(n^/^/logn) 

[24] 

4-cycle counting 

0(^0.158) 

0(n^/^/logn) 

[24] 

fe-cycle detection 

2O(fc)^0.158 

0{nf~‘^l^ j logn) 

[24] 

girth 

0(^0.158) 

— 


weighted, directed APSP 

0(n^/^ logn) 

— 


• weighted diameter U 

0([/n°'i5®) 

— 


• (1 -1- o(I))-approximation 

0(n°'i5®) 

— 


• (2-|-o(l))-approximation 


0(n^/^) 

[57] 

unweighted, undirected APSP 
• (2-|-o(l))-approximation 

0(^0.158) 

0(n^/^) 

[57] 


Table 1: Our results versus prior work, for the currently best known bound w < 2.3729 [33]; 
O notation hides polylogarithmic factors. 


1 









1.1 Matrix Multiplication on a Congested Clique 

As a basic primitive, we consider the computation of the product P = ST of two n x n 
matrices S and T on a congested clique of n nodes. We will tacitly assume that the 
matrices are initially distributed so that node v has row v of both S and T, and each node 
will receive row u of P in the end. Recall that the matrix multiplication exponent uj is 
dehned as the inhmum over a such that product of two n x n matrices can be computed 
with 0{n^) arithmetic operations; it is known that 2 < oj < 2.3728639 [33], and it is 
conjectured, though not unanimously, that a; = 2. 

Theorem 1. The product of two matrices n x n can be computed in a congested clique of 
n nodes in rounds over semirings. Over rings, this product can he computed in 

rounds for any constant e > 0. 

Theorem 1 follows by adapting known parallel matrix multiplication algorithms for 
semirings [1, 54] and rings [7, 52, 55, 70] to the clique model, via the routing technique 
of Lenzen [46]. In fact, with little extra work one can show that the resulting algorithm is 
also oblivious, that is, the communication pattern is predefined and does not depend on 
the input matrices. Hence, the oblivious routing technique of Dolev et al. [24] suffice for 
implementing these matrix multiplication algorithms. 

The above addresses matrices whose entries can be encoded with O(logn) bits, which is 
sufficient for dealing with integers of absolute value at most In general, if b bits are 

sufficient to encode matrix entries, the bounds above hold with a multiplicative factor of 
6/logn; for example, working with integers with absolute value at most 2”^ merely incurs 
a factor overhead in running times. 

Distributed matrix multiplication exponent. Analogously with the matrix multi¬ 
plication exponent, we denote by p the exponent of matrix multiplication in the congested 
clique model, that is, the inhmum over all values a such that there exists a matrix mul¬ 
tiplication algorithm in the congested clique running in 0{n'^) rounds. In this notation, 
Theorem 1 gives us 

p < 1 - 2/a; < 0.15715; 

prior to this work, it was known that p <00 — 2 [25]. 

For the rest of this paper, we will ~ analogously with the convention in centralised 
algorithmics - slightly abuse this notation by writing for the complexity of matrix 
multiplication in the congested clique. This hides factors up to 0{n^) resulting from the 
fact that the exponent p is dehned as inhmum of an inhnite set. 

Lower bounds for matrix multiplication. The matrix multiplication results are 
optimal in the sense that for any sequential matrix multiplication implementation, any 
scheme for simulating that implementation in the congested clique cannot give a faster 
algorithm than the construction underlying Theorem 1; this follows from known results 
for parallel matrix multiplication [2, 8, 41, 69]. Moreover, we note that for the broadcast 
congested clique model, where each node is required to send the same message to all nodes 
in any given round, recent lower bounds [38] imply that matrix multiplication cannot be 
done faster than H(n) rounds. 
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1.2 Applications in Subgraph Detection 

Cycle detection and connting. Our first application of fast matrix multiplication is 
to the problems of triangle counting [42] and 4-cycle counting. 

Corollary 2. For directed and undirected graphs, the number of triangles and f-cycles can 
be computed in 0{nP) rounds. 

For p < 1 — 2/uj, this is an improvement upon the previously best known 0{n^^^)- 
round triangle detection algorithm of Dolev et al. [24] and an 0(re^“^'’"'')-round algorithm 
of Drucker et al. [25]. Indeed, we disprove the conjecture of Dolev et al. [24] that any 
deterministic oblivious algorithm for detecting triangles requires rounds. 

When only detection of cycles is required, we observe that combining the fast distributed 
matrix multiplication with the well-known technique of colour-coding [5] allows us to detect 
fe-cycles in 0{n^) rounds for any constant k. This improves upon the subgraph detection 
algorithm of Dolev et al. [24], which requires rounds for detecting subgraphs 

of k nodes. However, we do not improve upon the algorithm of Dolev et al. for general 
subgraph detection. 

Theorem 3. For directed and undirected graphs, the existence of k-cycles can be detected 
in 2^(^^n^logn rounds. 

For the specific case of A: = 4, we provide a novel algorithm that does not use matrix 
multiplication and detects 4-cycles in only 0(1) rounds. 

Theorem 4. The existence of f-cycles can be detected in 0(1) rounds. 

Girth. We compute the girth of a graph by leveraging a known trade-off between the 
girth and the number of edges of the graph [53]. Roughly, we detect short cycles fast, and 
if they do not exist then the graph must have sufficiently few edges to be learned by all 
nodes. As far as we are aware, this is the first algorithm to compute the girth in this 
setting. 

Theorem 5. For undirected, unweighted graphs, the girth can be computed in 0{nP) 
rounds. 

1.3 Applications in Distance Compntation 

Shortest paths. The all-pairs shortest paths problem (APSP) likewise admits algorithms 
based on matrix multiplication. The basic idea is to compute the power of the input 
graph’s weight matrix over the min-plus semiring, by iteratively computing squares of the 
matrix [27, 32, 56]. 

Corollary 6. For weighted, directed graphs with integer weights in {0, ±1, ..., ±M}, all¬ 
pairs shortest paths can be computed in ©(n^/^lognflogM/logn]) communication rounds. 

We can leverage fast ring matrix multiplication to improve upon the above result; how¬ 
ever, the use of ring matrix multiplication necessitates some trade-offs or extra assumptions. 
For example, for unweighted and undirected graphs, it is possible to recover the exact 
shortest paths from powers of the adjacency matrix over the Boolean semiring [65]. 
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Corollary 7. For undirected, unweighted graphs, all-pairs shortest paths can be computed 
in 0{n^) rounds. 

For small integer weights, we use the well-known idea of embedding a min-plus semiring 
matrix product into a matrix product over a ring; this gives a multiplicative factor to the 
running time proportional to the length of the longest path. 

Corollary 8. For directed graphs with positive integer weights and weighted diameter U, 
all-pairs shortest paths can be computed in 0{UnP) rounds. 

While this corollary is only relevant for graphs of small weighted diameter, the same 
idea can be combined with weight rounding [57, 64, 76] to obtain a fast approximate APSP 
algorithm without such limitations. 

Theorem 9. For directed graphs with integer weights in {0,1,..., we can compute 

(1 -|- o{l))-approximate all-pairs shortest paths in rounds. 

For comparison, the previously best known combinatorial algorithm for APSP on the 
congested clique achieves a (2 -|- o(l))-approximation in rounds [57]. 

1.4 Additional Related Work 

Computing distances in graphs, such as the diameter, all-pairs shortest paths (APSP), and 
single-source shortest paths (SSSP) are fundamental problems in most computing settings. 
The reason for this lies in the abundance of applications of such computations, evident 
also by the huge amount of research dedicated to it [18, 19, 30, 34, 35, 67, 68, 73, 75-77]. 

In particular, computing graph distances is vital for many distributed applications and, 
as such, has been widely studied in the CONGEST model of computation [60], where n 
processors located in n distinct nodes of a graph G communicate over the graph edges 
using 0(logn)-bit messages. Specifically, many algorithms and lower bounds were given 
for computing and approximating graph distances in this setting [23, 31, 39, 40, 45, 47, 48, 
57, 61, 62]. Some lower bounds apply even for graphs of small diameter; however, these 
lower bound constructions boil down to graphs that contain bottleneck edges limiting the 
amount of information that can be exchanged between different parts of the graph quickly. 

The intuition that the congested clique model would abstract away distances and 
bottlenecks and bring to light only the congestion challenge has proven inaccurate. Indeed, 
a number of tasks have been shown to admit sub-logarithmic or even constant-round 
solutions, exceeding by far what is possible in the CONGEST model with only low 
diameter. The pioneering work of Lotker et al. [51] shows that a minimum spanning tree 
(MST) can be computed in O(loglogn) rounds. Hegeman et al. [37] show how to construct 
a 3-ruling set, with applications to maximal independent set and an approximation of the 
MST in certain families of graphs; sorting and routing have been recently addressed by 
various authors [46, 49, 59]. A connection between the congested clique model and the 
MapReduce model is discussed by Hegeman and Pemmaraju [36], where algorithms are 
given for colouring problems. On top of these positive results, Drucker et al. [25] recently 
proved that essentially any non-trivial unconditional lower bound on the congested clique 
would imply novel circuit complexity lower bounds. 

The same work also points out the connection between fast matrix multiplication 
algorithms and triangle detection in the congested clique. Their construction yields an 
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round algorithm for matrix multiplication over rings in the congested clique 
model, giving also the same running bound for triangle detection; if a; = 2, this gives 
p = 0, matching our result. However, with the currently best known centralised matrix 
multiplication algorithm, the running time of the resulting triangle detection algorithm 
is rounds, still slower than the combinatorial triangle detection of Dolev et al. 

[24], and if w > 2, the solution presented in this paper is faster. 

2 Matrix Multiplication Algorithms 

In this section, we consider computing the product P = ST of two nxn matrices S = (Sij) 
and T = {Tij) on the congested clique with n nodes. For convenience, we tacitly assume 
that nodes v G V are identified with {1, 2,..., n}, and use nodes u G F to directly index 
the matrices. The local input in the matrix multiplication task for each node u G F is the 
row V of both S and T, and the at the end of the computation each node v G V will output 
the row v of P. However, we note that the exact distribution of the input and output is 
not important, as we can re-arrange the entries in constant rounds as long as each node 
has 0{n) entries [46]. 

Theorem 1. The product of two matrices nxn can he computed in a congested clique of 
n nodes in 0{n^^^) rounds over semirings. Over rings, this product can be computed in 
rounds for any constant e > 0. 

Theorem 1 follows directly by simulating known parallel matrix multiplication algo¬ 
rithms in the congested clique model using a result of [46]. This work discusses simulation 
of the bulk-synchronous parallel (BSP) model, which we can use to obtain Theorem 1 as a 
corollary from known BSP matrix multiplication results [54, 55, 70]. However, essentially 
the same matrix multiplication algorithms have been widely studied in various parallel 
computation models, and the routing scheme underlying the simulation result of [46] allows 
also simulation of these other models on the congested clique: 

- The first part of Theorem 1 is based on the so-called parallel 3D matrix multiplication 
algorithm [1, 54], essentially a parallel implementation of the school-book matrix 
multiplication; alternatively, the same algorithm can be obtained by slightly modifying 
the triangle counting algorithm of Dolev et al. [24]. 

- The second part uses a scheme that allows one to adapt any bilinear matrix multipli¬ 
cation algorithm into a fast parallel matrix multiplication algorithm [7, 52, 55, 70]. 

A more detailed examination in fact shows that the matrix multiplication algorithms 
are oblivious, that is, the communication pattern is pre-defined and only the content of the 
messages depends on the input. This further allows us to use the static routing scheme 
of Dolev et al. [24], resulting in simpler algorithms with smaller constant factors in the 
running time. 

To account for all the details, and to provide an easy access for readers not familiar 
with the parallel computing literature, we present the congested clique versions of these 
algorithms in full detail in Sections 2.1 and 2.2. 
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Figure 1: Semiring matrix multiplication: partitioning scheme for matrix entries. 

2.1 Semiring matrix multiplication 

Preliminaries. For convenience, let us assume that the number of nodes is such that 
is an integer. We view each node n G F as a three-tuple V 1 V 2 V 3 where ui, ^ 2 , ^3 G for 

concreteness, we may think that V 1 V 2 V 3 is the representation of n as a three-digit number 
in base-n^/^. 

For a matrix S and index sets U,W C V, use the notation S[U,W] to refer to the 
submatrix obtained by taking all rows u with u € U and columns w with w G W. To easily 
refer to specihc subsets of indices, we use * as a wild-card in this notation; specifically, we 
use notation x** = {n: ui = x}, *x* = {v: V 2 = x} and **x = {n: us = x}. Finally, in 
conjunction with this notation, we use the shorthand * to denote the whole index set V 
and V to refer to a singleton set {u}. See Figure 1. 

Overview. The distributed implementation of the school-book matrix multiplication we 
present is known as the 3D algorithm. To illustrate why, we note that the element-wise 
multiplications of the form 


Puw — ^uv^vw 1 li, n, m G F 

can be viewed as points in the cube F x F x F. To split the element-wise multiplications 
equally among the nodes, we partition this cube into n subcubes of size x x 
Specifically, each node v is assigned the subcube ui** x V 2 ** x xa**, corresponding to the 
multiplication task 

X2**]T'['U2**, 1:3**] . 

Algorithm description. The algorithm computes n x n intermediate matrices = 
S'[*, *] for w G so that each node v computes the block 

'n3=t==^] = 'U2**]T'['n2**,'n3=t=>^] . 

Specifically, this is done as follows. 
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Step 1: Distributing the entries. Each node v £ V sends, for each node u G ni**, the 
submatrix S[v, U 2 **\ to node rt, and for each node w G = 1 =^ 2 *, the submatrix T[v, rrs**] 
to w. Each such submatrix has size and there are recipients, for a total of 

2 n^l'^ messages per node. 

Dually, each node v £ V receives the submatrix U 2 **] and the submatrix 

T[v 2 **,V 3 **]. In particular, the submatrix S'[u,U 2 **] is received from the node u for 
u £ ui**, and the submatrix T[w, U 3 **] is received from the node w £ *V 2 *- In total, 
each node receives messages. 

Step 2: Multiplication. Each node v £ V computes the product S[ui**,U 2 **] and 
T[v 2 **,V 3 **] to get the x product matrix P^'"^')[vi**,V 3 **]. 

Step 3: Distributing the prodncts. Each node v £V sends submatrix P^'"'^')[u,V 3 **] 
to each node u £ ui**. Each such submatrix has size and there are 
recipients, for a total of messages per node. 

Dually, each node v £V receives the submatrices P^'^')[v,*] for each w £ In 

particular, the submatrix P^^^'^[v, U 3 **] is received from the node u £ ui**. The total 
number of received messages is per node. 

Step 4: Assembling the prodnct. Each node v £ V computes the submatrix P[v, *] = 
Z^«)e[ni/ 3 ] P^'^'^iv, *] of the product P = ST. 

Analysis. The maximal number of messages sent or received in one of the above steps 
is Moreover, the communication pattern clearly does not depend on the input 

matrices, so the algorithm can be implemented in oblivious way on the congested clique 
using the routing scheme of Dolev et al. [24, Lemma 1]; the running time is rounds. 

2.2 Fast Matrix Multiplication 

Bilinear matrix multiplication. Consider a bilinear algorithm multiplying two d x d 
matrices using m < d? scalar multiplications, such as the Strassen algorithm [66]. Such an 
algorithm computes the matrix product P = ST by first computing m linear combinations 
of entries of both matrices, 

,§(-)= ^ aij^Si, and f (-) = ^ ( 1 ) 

(i,j)e[rf ]2 (*j) 6 [rf ]2 

for each w £ [m], then computing the products P^'^l = for w £ [m], and finally 

obtaining P as 

Pij = > for (bj) e [d?, (2) 

w£[m] 

where aijw, (dijw and Xij^ are scalar constants that define the algorithm. In this section 
we show that any bilinear matrix multiplication algorithm can be efficiently translated to 
the congested clique model. 
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Lemma 10. Let R be a ring, and assume there exists a family of bilinear matrix multiplica¬ 
tion algorithms that can compute product ofnxn matrices with 0{n‘^) multiplications. Then 
matrix multiplication over R can be computed in the congested clique in logn)) 

rounds, where b is the number of bits required for encoding a single element of R. 

In particular for integers, rationals and their extensions, it is known that for any 
constant e > 0 there is a bilinear algorithm for matrix multiplication that uses 
multiplications [17]; thus, the second part of Theorem 1 follows from the above lemma. 

Preliminaries. Let us fix a bilinear algorithm that computes the product oidxd matrices 
using m{d) = 0{d^) scalar multiplications for any d, where 2 < a < 3. To multiply two 
n X n matrices on a congested clique of n nodes, fix d so that m{d) = n, assuming for 
convenience that n is such that this is possible. Note that we have d = 

Similarly with the semiring matrix multiplication, we view each node v as three-tuple 
V 1 V 2 V 3 , where we assume that vi G [d], V 2 G and V 3 G that is, V 1 V 2 V 3 can 

be viewed as a mixed-radix representation of the integer v. This induces a partitioning of 
the input matrices S and T into a two-level grid of submatrices; using the same wild-card 
notation as before, S is partitioned into a d x d grid of n/d x n/d submatrices 5[i**,j**] 
for {i,j) G [d]^, and each of these submatrices is further partitioned into an x 
grid of n^/^/d x nf!'^jd submatrices S[ix*, jy*] for x,y € The other input matrix T 

is partitioned similarly; see Figure 2. 

Finally, we give each node u G V a unique secondary label l(y) = X1X2 G again, 

for concreteness we assume that X 1 X 2 is the representation of v in base-n^/^ system, so 
this label can be computed from v directly. 

Overview. The basic idea of the fast distributed matrix multiplication is that we view 
the matrices S and T as d x d matrices S' and T' over the ring of n/d x n/d matrices, 
where 

S'ij = jT=t=] , = T[i**,j**] , i, j G [d] , 

which allows us to use (1) and (2) to compute the matrix product using the hxed bi¬ 
linear algorithm; specifically, this reduces the n x n matrix product into n instances of 
^ .^.^ 1 - 21(7 jjjatrix products, each of which is given to a different node. For the linear 
combination steps, we use a partitioning scheme where each node v with secondary label 
i{v) = X 1 X 2 is responsible for /d x nfl‘^/d of the matrices involved in the computation. 

Algorithm description. The algorithm computes the matrix product P = ST as 
follows. 

Step 1: Distributing the entries. Each node v sends, for X 2 G the submatrices 

S[v, *X 2 *] and T[v, *X 2 *\ to the node u with label i{u) = V 2 X 2 . Each submatrix has 
entries and there are recipients each receiving two submatrices, for a total 
of 2n messages per node. 

Dually, each node u with label l{u) = X 1 X 2 receives the submatrices S[v, *X 2 *\ and 
T[v, *X 2 *\ from the nodes v = V 1 V 2 V 3 with V 2 = xi. In particular, node u now has the 
submatrices 5[*a:i*, *X 2 *] and T[*xi*, *X 2 *]. The total number of received messages 
is 2n per node. 


V 


V 




jy* 









*y*] 


Figure 2: Fast matrix multiplication: partitioning schemes for matrix entries. 


Step 2: Linear combination of entries. Each node v with label i{v) = xiX 2 computes 
for w € V the linear combinations 

X2*] = ^ aijujS[ixi*,jx2*], and 

(ij)e[rf]2 

f’("')[xi=t=,X2*] = ^ (3ijioT[ixi*,jx2*]. 

The computation is performed entirely locally. 

Step 3: Distributing the linear combinations. Each node v with label i{v) = xiX 2 
sends, for w G W, the submatrices [xi*, X 2 *] and [xi*, X 2 *] to node w. Each 
submatrix has entries and there are n recipients each receiving 

two submatrices, for a total of messages per node. 

Dually, each node w receives the submatrices X 2 *] and [xi*, X 2 *] 

from node v € V with label i{v) = xiX 2 - Node u now has the matrices and 
j'i'w) ^ The total number of received messages is per node. 
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Step 4: Multiplication. Node w G V computes the product The 

computation is performed entirely locally. 

Step 5: Distributing the products. Each node w sends, for xi,X 2 G the subma¬ 

trix P^^">[xi* ,X 2 *\ to node v with label xiX 2 - Each submatrix has jdp = 
entries and there are n recipients, for a total of messages sent 

by each node. 

Dually, each node u G E with label l{y) = xiX 2 receives the submatrix X 2 *\ 

from each node w £ V. The total number of received messages is per 

node. 

Step 6: Linear combination of products. Each node v £ V with label i{v) = xiX 2 
computes for i,j£ [d] the linear combination 

P[iXi*JX2*] = ^ XijwP^'^'^[xi*,X2*] . 
w£V 

Node V £V now has the submatrix P[*xi*,*X 2 *]- The computation is performed 
entirely locally. 

Step 7: Assembling the product. Each node v £V with label i{v) = xiX 2 sends, for 
each node u £ V with U 2 = xi, the submatrix P[u,*X 2 *] to the node u. Each 
submatrix has entries and there are recipients, for a total of n messages 
sent by each node. 

Dually, each node u £ V receives the submatrix P[u, *X 2 *\ from the node v with 
label £(v) = U 2 X 2 - Node u now has the row P[u,*] of the product matrix P. The 
total number of received messages is n per node. 

Analysis. The maximal number of messages sent or received by a node in the above steps 
is Moreover, the communication pattern clearly does not depend on the input 

matrices, so the algorithm can be implemented in an oblivious way on the congested clique 
using the routing scheme of Dolev et al. [24, Lemma 1]; the running time is 
rounds. 


3 Upper Bounds 

3.1 Subgraph Detection and Connting 

The subgraph detection and counting algorithms we present are mainly based on applying 
the fast matrix multiplication to the adjacency matrix A of a graph G = {V, E), defined as 


A 


UV 


1 if (tt, v) £ E , 
0 if {u, v) ^ E , 


where we assume that for undirected graphs edges {u, v} £ E are oriented both ways. 
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Counting triangles and 4-cycles. For counting triangles, that is, 3-cycles, we use a 
technique first observed by Itai and Rodeh [42]. That is, in an undirected graph with 
adjacency matrix A, the number of triangles is known to be g tr(^^), where the trace tr(5) 
of a matrix S is the sum of its diagonal entries Suu- Similarly, for directed graphs, the 
number of triangles is ^ tr(74^). 

Alon et al. [6] generalise the above formula to counting undirected and directed /c-cycles 
for small k. For example, the number of 4-cycles in an undirected graph is given by 

\ tr(A^) - (2{deg{v)f - deg(u)) . 
v&V 

Likewise, if G is a loopless directed graph and we denote for u G F by 5{v) the number of 
nodes u G V such that {(n, u), {v,u)} C E, then the number of directed 4-cycles in G is 

^ tr(A^)-^(2((f(u))2-<5(u)) . 

v&V 

Combining these observations with Theorem 1, we immediately obtain Corollary 2: 

Corollary 2. For directed and undirected graphs, the number of triangles and f-cycles can 
he computed in 0{n^) rounds. 

We note that similar trace formulas exists for counting fe-cycles for k G {5,6,7}, 
requiring only computation of small powers of A and local information. We omit the 
detailed discussion of these in the context of the congested clique; see Alon et al. [6] for 
details. 

Detecting A:-cycles. For detection of /c-cycles we leverage the colour-coding techniques 
of Alon et al. [5] in addition to the matrix multiplication. Again, the distributed algorithm 
is a straightforward adaptation of a centralised one. 

Fix a constant A: G N. Let c: F — >• [/c] be a labelling (or colouring) of the nodes by k 
colours, such that node v knows its colour c{v); it should be stressed here that the colouring 
need not to be a proper colouring in the sense of the graph colouring problem. As a first 
step, we consider the problem of hnding a colourful k-cycle, that is, a /c-cycle such that 
each colour occurs exactly once on the cycle. We present the details assuming that the 
graph G is directed, but the technique works in an identical way for undirected graphs. 

Lemma 11. Given a graph G = {V,E) and a colouring c: F —>■ [k], a colourful k-cycle 
can he detected in 0(3^n^) rounds. 

Proof. For each subset of colours X C [k], let be a Boolean matrix such that = 1 
if there is a path of length jX] — 1 from u to v containing exactly one node of each colour 
from X, and Gi^'^ = 0 otherwise. For a singleton set {f} C [A:], the matrix contains 

1 only on the main diagonal, and only for nodes v with c{v) = i; hence, node v can locally 
compute the row v of the matrix from its colour. For a non-singleton colour set X, we 
have that 

CW= V C^^^AG^^\^\ (3) 

YQX 

\y\=\\x\m 
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where the products are computed over the Boolean semiring and V denotes element-wise 

logical or. Thus, we can compute for all X C [fe] by applying (3) recursively; there is 

(\k]) 

a colourful fe-cycle in G if and only if there is a pair of nodes u,v € V such that Cuv ' = 1 
and {v, u) G E. 

To leverage fast matrix multiplication, we simply perform the operations stated in (3) 
over the ring Z and observe that an entry of the resulting matrix is non-zero if and only 
if the corresponding entry of is non-zero. The application of (3) needs two matrix 
multiplications for each pair {Y,X) with Y C [k] and |y| = [jXl /2] = \k/2']. The number 
of such pairs is bounded by 3^; to see this, note that the set {(T, X): Y C X C [fc]} 
can be identified with the set {0,1,2}^ of trinary strings of length k via the bijection 
wiW 2 .. .Wk ^ {{i'- Wi = 0},{f: Wi < 1}), and the set {0,1,2}^ has size exactly 3^. Thus, 
the total number of matrix multiplications used is at most 0(3^). □ 

We can now use Lemma 11 to prove Theorem 3; while we cannot directly construct a 
suitable colouring from scratch for an uncoloured graph, we can try an exponential in k 
number of colourings to find a suitable one. 

Theorem 3. For directed and undirected graphs, the existenee of k-cycles can he detected 
in 2^(^)n^logn rounds. 

Proof. To apply Lemma 11, we first have to obtain a colouring c: V ^ [k] that assigns 
each colour once to at least one /c-cycle in G, assuming that one exists. If we pick a colour 
c{v) G [k] for each node uniformly at random, then for any k-cycle G in G, the probability 
that G is colourful in the colouring c is k\/kf < e~^. Thus, by picking e^^logn uniformly 
random colourings and applying Lemma 11 to each of them, we find a A:-cycle with high 
probability if one exists. 

This algorithm can also be derandomised using standard techniques. A k-perfect family 
of hash functions H is a collection of functions h: V ^ [k] such that for each U FV with 
\U\ = k, there is at least one h gFL such that h assigns a distinct colour to each node in U. 
There are known constructions that give such families PL with \'H\ = logn and these 
can be efficiently constructed [5]; thus, it suffices to take such an PL and apply Lemma 11 
for each colouring h G PL. □ 

Detecting 4-cycles. We have seen how to count 4-cycles with the help of matrix 
multiplication in 0{n^) rounds. We now show how to detect 4-cycles in 0(1) rounds. 
The algorithm does not make direct use of matrix multiplication algorithms. However, 
the key part of the algorithm can be interpreted as an efficient routine for sparse matrix 
multiplication, under a specihc dehnition of sparseness. 

Let 


P{X, Y, Z) = {(x, y,z) : X G X,y gY,z G Z, {x, y} G E, {y, z} G E} 

consist of all distinct 2-walks (paths of length 2) from X through Y to Z. We will use 
again the shorthand notation v for {u} and * for V; for example, P{x, *, *) consists of all 
walks of length 2 from node x. There exists a 4-cycle if and only if \P{x, *,z)\ >2 for some 
x ^ z. 

On a high level, the algorithm proceeds as follows. 
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1. Each node x computes \P{x, *, *)|. If \P{x, *, *)| > 2n — 1, then there has to be some 
z X such that \P{x, *,z)\ > 2, which implies that there exists a 4-cycle, and the 
algorithm stops. 

2. Otherwise, each node x finds P(x, *, *) and checks if there exists some z ^ x such 
that \P{x, *,z)\ > 2. 

The first phase is easy to implement in 0(1) rounds. The key idea is that if the algorithm 
does not stop in the first phase, then the total volume of P{*, *, *) is sufficiently small so 
that we can afford to gather P(x,*,*) for each node x in 0(1) rounds. 

We now present the algorithm in more detail. We write N(x) for the neighbours of node 
X. To implement the first phase, it is sufficient for each node y to broadcast deg(y) = |iV(?/)| 
to all other nodes; we have 


\P{x,*,*)\= ^ deg( 2 /). 

yeN{x) 

Now let us explain the second phase. Each node y is already aware of N{y) and hence 
it can construct P{*,y, *) = N(y) x {y} x N{y). Our goal is to distribute the set of all 
2-walks 

U Pi*, y, *) = Pi*, *, *) = IJ Pix, *, *) 

y a; 

SO that each node x will know T’(x, *, *). 

In the second phase, we have 

^deg(y)2 = ^\P{*,y,*)\ = ^\p(x,*,*)\ < 2n^. 
y y a; 

Using this bound, we obtain the following lemma. 

Lemma 12. It is possible to find sets A{y) and B{y) for each y G U such that the following 
holds: 


• A{y) C U, B{y) C U, and \A{y)\ = \B{y)\ > deg(y)/8, 

• the tiles A{y) x B{y) are disjoint subsets of the square V x V. 

Moreover, this can be done in 0(1) rounds in the congested clique. 

Proof. Let /(y) be deg(y)/4 rounded down to the nearest power of 2, and let A: be n 
rounded down to the nearest power of 2. We have J^yfiv)'^ ^ Z)deg(y)^/16 < n^/8 < A:^. 
Now it is easy to place the tiles of dimensions /(y) x /(y) inside a square of dimensions 
k X k without any overlap with the following iterative procedure: 

• Before step i = 1,2,..., we have partitioned the square in sub-squares of dimensions 
A:/2*“^ X kj2'^~^, and each sub-square is either completely full or completely empty. 

• During step i, we divide each sub-square in 4 parts, and fill empty squares with tiles 
of dimensions /(y) = k/2^. 

• After step i, we have partitioned the square in sub-squares of dimensions A:/2* x A:/2*, 
and each sub-square is either completely full or completely empty. 
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Figure 3: 4-cycle detection: how P{*,*,*) is partitioned among the nodes. 

This way we have allocated disjoint tiles Aiy) x B(y) C \k] x\k\ xV for each u, with 
\My)\ = \B{y)\ = f{y) > deg{y)/8. 

To implement this in the congested clique model, it is sufficient that each y broadcasts 
deg(y) to all other nodes, and then all nodes follow the above procedure to compute A{y) 
and B{y) locally. □ 

Now we will use the tiles A{y) xB{y) to implement the second phase of 4-cycle detection. 
For convenience, we will use the following notation for each y £Y: 

• The sets NA{y,a) where a G A{y) form a partition of N{y) with |A^^(y, o)| < 8. 

• The sets Nsiy, b) where b G B{y) form a partition of N{y) with \NB{y, ^)| <8. 

Note that we can assume that A{y) and B{y) are globally known by Lemma 12. Hence a 
node can compute NA{y, a) and Nsiy, b) if it knows N{y). 

With this notation, the algorithm proceeds as follows (see Figure 3): 

1. For all y G F and a G A{y), node y sends NA{y, a) to a. 

This step can be implemented in 0(1) rounds. 

2. For each y and each pair (a, 6) G A{y) x B{y), node a sends NA{y,a) to b. 

Note that for each (a, b) there is at most one y such that (a, b) G A{y) x B{y)-, hence 
over each edge we send only 0(1) words. Therefore this step can be implemented in 
0(1) rounds. 

3. At this point, each b G V has received a copy of N{y) for all y with b G B{y). Node 
b computes 

W{y,b) = N{y)x{y]xNB{y,h). W{b) = J W{y,b). 

y:b&B{y) 


This is local computation; it takes 0 rounds. 
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We now give a lemma that captures the key properties of the algorithm. 

Lemma 13. The sets W{b) form a partition of P{*, *, *). Moreover, for each b we have 
\W{b)\ = 0{n). 

Proof. For the first claim, observe that the sets P{*, y, *) for y € V form a partition of 
P{*,*,*), the sets W{y,b) for b G Bijj) form a partition of P{*,y,*), and each set W{y,b) 
is part of exactly one W{b). 

For the second claim, let Y consist of all y G F with b € B{y). As the tiles A{y) x B{y) 
are disjoint for all y €Y, and all y G F have the common value b G B(y), it has to hold 
that the sets A{y) are disjoint subsets of V for all y €Y. Therefore 

i^(2/)i = Y < Y - ^1^1 = 

y£Y y£Y yGY 

With |Ais(y)| < 8 we get 

rwi = ^(,,.6)1 < 8 |F(2/)| < 64n. □ 

y&y y&Y 

Now we are almost done: we have distributed P{*, *, *) evenly among V so that each 
node only holds 0(n) elements. Finally, we use the dynamic routing scheme [46] to gather 
P{x, *, *) at each node x G F; here each node needs to send 0(n) words and receive 0(n) 
words, and the running time is therefore 0(1) rounds. In conclusion, we can implement 
both phases of 4-cycle detection in 0(1) rounds. 

Theorem 4. The existence of f-cycles can be detected in 0(1) rounds. 

3.2 Girth 

Undirected girth. Recall that the girth g of an undirected unweighted graph G = (F, E) 
is the length of the shortest cycle in G. To compute the girth in the congested clique 
model, we leverage the fast cycle detection algorithm and the following lemma giving a 
trade-off between the girth and the number of edges. A similar approach of bounding from 
above the number of edges of a graph that contains no copies of some given subgraph was 
taken by Drucker et al. [25]. 

Lemma 14 ([53, pp. 362-363]). A graph with girth g has at most edges. 

If the graph is dense, then by the above lemma it must have small girth and we can 
use fast cycle detection to compute it; otherwise, the graph is sparse and we can learn the 
complete graph structure. 

Theorem 15. For undirected graphs, the girth can be computed in 0{n^) rounds (or in 
rounds, ifp = 0). 

Proof. Assume for now that p > 0, and fix£ = [2-|-2/p]. Each node collects all graph 
degrees and computes the total number of edges. If there are at most n = 

p)(j.ji+P) can collect full information about the graph structure to all nodes in 
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O(n^) rounds using an algorithm of Dolev et al. [24], and each node can then compute the 
girth locally. 

Otherwise, by Lemma 14, the graph has girth at most i. Thus, for A: = 3,4,..., we 
try to find a /c-cycle using Theorem 3, in £ • = 0{n^) rounds. When such a 

cycle is found for some k, we stop and return k as the girth. 

Finally, if p = 0, we pick i = log log n, and both cases take rounds. □ 


Directed girth. For a directed graph, the girth is defined as the length of the shortest 
directed cycle; the main difference is that directed girth can be 1 or 2. While the trade-off 
of Lemma 14 cannot be used for directed graphs, we can use a simpler technique of Itai 
and Rodeh [42]. 

Let G = (R, E) be a directed graph; we can assume that there are no self-loops in G, as 
otherwise girth is 1 and we can detect this with local computation. Let be a Boolean 
matrix defined as 



1 if there is a path of length I from u to u for 1 < (. <i, 
0 otherwise. 


Clearly, we have that = A. Moreover, ii i = j + k, we have 

VA, (4) 

where the matrix product is over the Boolean semiring and V denotes element-wise logical 
or. 

Corollary 16. For directed graphs, the girth can be computed in 0{n^) rounds. 

Proof. It suffices to find smallest i such that there is u G R with = 1; clearly i is 
then the girth of graph G. We first compute A = B^^\ B^‘^\ B^^\ B^^\ ... using (4) with 
j = k = il2 until we find i such that B^fv = 1 for some u G R. We then know that the 
girth is between i and i/2] we can perform binary search on this interval to find the girth, 
using (4) to evaluate the intermediate matrices. This requires O(logn) calls to the matrix 
multiplication algorithm. □ 


3.3 Routing and Shortest Paths 

In this section, we present algorithms for variants of the all-pairs shortest paths (APSP) 
problem. In the congested clique model, the local input for a node n G R in the APSP 
problem is a vector containing the local edge weights W{u, v) for u G R. The output for 
tt G R is the actual shortest path distances d{u, v) for each other node v G V, along with 
the routing table entries R[u,v\, where each entry R[u,v] = u) G R is a node such that 
{u, w) G V and w lies on a shortest path on from u to w. For convenience, we use the same 
notation for directed and undirected graphs, assume W{u,v) = oo if {u,v) ^ E, and for 
unweighted graphs, we set W{u,v) = 1 for each {u,v) G E. 

For a graph G = (R, E) with edge weights IR, we define the weight matrix W as 


Wn 


W{u,v) if u ^ V , 
0 if u = V . 
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Our APSP algorithms are mostly based on the manipulation of the weight matrix W and 
the adjacency matrix A, as defined in Section 3.1. 

Distance product and iterated squaring. Matrix multiplication can be used to 
compute the shortest path distances via iterated squaring of the weight matrix over the 
min-plus semiring [27, 32, 56]. That is, the matrix product is the distance product, also 
known as the min-plus product or tropical product, dehned as 

— minlS^^^ T TyjA . 

w ' 

Given a graph G = (V,E) with weight matrix W, the distance product power 
gives the actual distances in G as d{v,u) = Computing can be done with [logn] 
distance products by iteratively squaring W, that is, we compute 

W^ = Wi^W, = W^*W^, ..., VP” = 1T”/2 ^ ppn/2 _ 

Combining this observation with the semiring algorithm from Theorem 1, we immediately 
obtain a simple APSP algorithm for the congested clique. 

Corollary 6. For weighted, directed graphs with integer weights in {0, ±1,... ,±M}, all¬ 
pairs shortest paths can be computed in 0{n^^^logn\logM/logn]) communication rounds. 

The subsequent APSP algorithms we discuss in this section are, for the most part, 
similarly based on the iterated squaring of the weight matrix; the main difference is that 
we replace the semiring matrix multiplication with distance product algorithms derived 
from the fast matrix multiplication algorithm. 

Constructing routing tables. The iterated squaring algorithm of Corollary 6 can be 
adapted to also compute a routing table R as follows. Assume that our distance product 
algorithm also provides for the distance product 5 * T a witness matrix Q such that if 
Quv = W-, then {S * T)uv = Suw + Twv With this information, we can compute the 
routing table R during the iterated squaring algorithm; when we compute the product 
pp2* — we also obtain a witness matrix Q, and update the routing table by setting 

R[u, u] = R[u, Quv] 


for each u,v €V with W^l < 

The semiring matrix multiplication can be easily modified to produce witnesses, but 
for the subsequent distance product algorithms based on fast matrix multiplication this 
is not directly possible. However, we can apply known techniques from the centralised 
setting to obtain witnesses also in these cases [4, 65, 76]; we refer to Section 3.4 for details. 

Unweighted undirected APSP. In the case of unweighted undirected graphs, we 
can obtain exact all-pairs shortest paths via a technique of Seidel [65]. Specifically, let 
G = {V, E) an unweighted undirected graph with adjacency matrix A; the power G^ of 
G is a graph with node set V and edge set {{u,u}: d{u,v) < k}. In particular, the square 
graph G^ can be constructed in 0{nP) rounds from G, as the adjacency matrix of G^ is 
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V A, where the product is over the Boolean semiring and V denotes element-wise logical 
or. 

The following lemma of Seidel allows us to compute distances in G if we already know 
distances in the square graph G^; to avoid ambiguity, we write in this subsection dciu, v) 
for the distances in a graph G. 

Lemma 17 ([65]). Let G = {V, E) he an unweighted undirected graph with adjacency matrix 
A, and let D be a distance matrix for G^, that is, a matrix with the entries Duv = dQ 2 {u, v). 
Let S = DA, where the product is computed over integers. We have that 


dG{u,v) = 


1 2dG2(u,v) 

[2dG2(u,v} - 1 if Suv < dG2{u,v) degGiv). 


if Suv > dG 2 iu,v)degGiv), and 


We can now recover all-pairs shortest distances in an undirected unweighted graph by 
recursively applying Lemma 17. 

Corollary 7. For undirected, unweighted graphs, all-pairs shortest paths can be computed 
in 0{nf’) rounds. 

Proof. Let G = {V, E) be an unweighted undirected graph with adjacency matrix A. We 
first compute the adjacency matrix for as noted above, this can be done in O(n^) 
rounds. There are now two cases to consider. 

1. If G = G^, then dG{u,v) = 1 if u and v are adjacent in G, and dG{u,v) = oo 
otherwise; thus, we are done. 

2. Otherwise, we compute all-pairs shortest path distances in the graph G^; since 
we have already constructed the adjacency matrix for G^, we can do the distance 
computation in G^ by recursively calling this algorithm with input graph G^. Then, 
we construct the matrix D with entries Duv = dG 2 {u, v) as in Lemma 17 and compute 
S = DA. We can recover distances in G using Lemma 17, as each node can transmit 
their degree in G to each other node in a single round and then check the conditions 
of the lemma locally. 

The recursion terminates in O(logn) calls, as the graph G"' consists of disjoint cliques. □ 


Weighted APSP with small weights. By embedding the distance product of two 
matrices into a suitable ring, we can use fast ring matrix multiplication to compute all-pairs 
shortest distances [74]; however, this is only practical for very small weights, as the ring 
embedding exponentially increases the amount of bits required to transmit the matrix 
entries. The following lemma encapsulates this idea. 

Lemma 18. Given n x n matrices S and T with entries in {0, 1,..., M} U {oo}, we can 
compute the distance product S-kT in 0{MnP) rounds. 

Proof. We construct matrices S* and T* by replacing each matrix entry w with X'^, 
where A is a formal variable; values oo are replaced by 0. We then compute the product 
S* ■ T* over the polynomial ring Z[A]; all polynomials involved in the computation have 
degree at most 2M and their coefficients are integers of absolute value at most so 
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this computation can be done in 0{Mn^) rounds. Finally, we can recover each matrix 
entry (S' -kT)uv in the original distance product by taking the degree of the lowest-degree 
monomial in (S* • T*)uv □ 


Using iterated squaring in combination with Lemma 18, we can compute all-pairs 
shortest paths up to a small distance M quickly; that is, we want to compute a matrix B 
such that 


B 


UV 


d{u,v) if d{u,v) < M, and 
oo if d{u, v) > M. 


This can be done by replacing all weights over M with oo before each squaring operation 
to ensure that we do not operate with too large values, giving us the following lemma. 


Lemma 19. Given a directed, weighted graph with non-negative integer weights, we can 
compute all-pairs shortest paths up to distance M in 0{Mn^) rounds. 

The above lemma can be used to compute all-pairs shortest paths quickly assuming 
that the weighted diameter of the graph is small; recall that the weighted diameter of a 
weighted graph is the maximum distance between any pair of nodes. 


Corollary 8. For directed graphs with positive integer weights and weighted diameter U, 
all-pairs shortest paths can be computed in 0{UnP) rounds. 

Proof. If we know that the weighted diameter is U, we can simply apply Lemma 19 with 
M = U. However, if we do not know U beforehand, we can (1) first compute the reachability 
matrix of the graph from the unweighted adjacency matrix, (2) guess U = 1 and compute 
all-pairs shortest paths up to distance U, and (3) check if we obtained distances for all 
pairs that are reachable according to the reachability matrix; if not, then we double our 
guess for U and repeat steps (2) and (3). □ 


Approximate weighted APSP. We can leverage the above result and a rounding 
technique to obtain a fast (1 -|- o(l))-approximation algorithm for the weighted directed 
APSP problem. Similar rounding-based approaches were previously used by Zwick [76] in 
a centralised setting and by Nanongkai [57] in the distributed setting; however, the idea 
can be traced back much further [64]. 

We hrst consider the computation of a (1 -|- (5)-approximate distance product over 
integers for a given 6 > 0; the following lemma is an analogue of one given by Zwick [76] 
in a centralised setting. 

Lemma 20. Given n x n matrices S and T with entries in {0,1,... , M} U {oo}, we can 
compute a matrix P satisfying 

Puv < Puv < (1 + d)Puv foru,veV, 
where P = S-kT is the distance product of S and T, in 0{nP{\ogij^^ M) / S') rounds. 

Proof. For i G {0, ..., [log^^^ -^11) let be the matrix defined as 

^ i + 5)i if Suv < 2(1 + Sy+^/d, and 

1 oo otherwise. 
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and let be defined similarly for T. Furthermore, let us define -kT^'^\ We 

now claim that selecting 

P„, = min{L(l + 5rP«J} 

I 

gives a matrix P with the desired properties. 

It follows directly from the definitions that Puv < Puv, so it remains to prove the other 
inequality. Thus, let us hx m, r; G V, and let re G 1/ be such that 


Puv - Puw T Pw 


Finally, let j = [log]^_,_^((iPu^/2)J. The choice of j means that Puv < 2(1 + 6y^^/S; 
since Suw and T^v are bounded from above by Puv, the entries Suw and T^v are hnite. 
Furthermore, we have 

(1+ 5ysy2 < Suu. + (1 + sy , (1+ sypyi < Tu,v + (i + ^y , 


and therefore 

{i+5ypyy < {I+6y{sy;i pry}) 

Pi Suw + Twv + 2(1 + sy 

P Puv + ^Puv = (1 + S)Puv ■ 

Finally, we have Puv P L(1 + SyPui''\ < (1 + S)Puv 

To see that we can compute the matrix P in the claimed time, we first note that 
each of the matrices 5^*^ and can be constructed locally by the nodes. The product 

pii) ^ 

can be computed in 0{n^/5) rounds for a single index i by Lemma 18, as 
the entries of and are integers bounded from above by 0(1/5); this is repeated for 
each index i, and the number of iterations is thus 0(log^_i_^ M). Finally, the matrix P can 
be constructed from matrices locally. □ 


Using Lemma 20, we obtain a (1 + o(l))-approximate APSP algorithm. 

Theorem 9. For directed graphs with integer weights in {0,1,..., 2"'°^^^}, we can compute 
(1 + o{l))-approximate all-pairs shortest paths in 0{nP~^PP) rounds. 

Proof. Let G = {V, E) be a directed weighted graph with edge weights in {0,1,..., M}, 
where M = . To compute the approximate shortest paths, we apply iterated squaring 

over the min-plus semiring to the weight matrix W of G, but use the approximate distance 
product algorithm of Lemma 20 to compute the products. After [logn] iterations, we 
obtain a matrix D; by induction we have 

d{u, v) < Duv P (1 + 5) d{u, v) for u, u G U . 

Selecting 5 = o(l/logn), this gives a (1 + o(l))-approximation for the shortest distances. 

To analyse the running time, we observe that we call the algorithm of Lemma 20 [logn] 
times; as the maximum distance between nodes in G is nM = 2”° , the running time of 

each call is bounded by 

/ra^logi+5(nM)\ _ / nP+PP \ 

V -5 J [6logil + 5))- 
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For sufficiently small <5, we have l/((5log(l + <5)) = Thus, for, e.g., 5 = 1/log^n = 

o(l/ logn), the total running time is as the polylogarithmic factors are subsumed 

by n°^^\ □ 


3.4 Witness Detection for Distance Product 

Witness problem for the distance product. As noted in Section 3.3, to recover the 
routing table in the APSP algorithms based on fast matrix multiplication in addition to 
computing the shortest path lengths, we need the ability to compute a witness matrix for 
the distance product 5 * T. That is, we need to find a matrix Q such that if Quv = w, 
then {S -k T)uv = Suw + in this case, the index w is called a witness for the pair {u, v). 

While one can easily modify the semiring matrix multiplication algorithm to provide 
witnesses, this is not directly possible with the fast matrix multiplication algorithms. 
However, known techniques from centralised algorithms [4, 65, 76] can be adapted to the 
congested clique to bridge this gap. 

Lemma 21. If we can compute the distance product for two n x n matrices S and T in 
M rounds, we can also find a witness matrix for S kT in M polylog (n) rounds. 

The rest of this section outlines the proof of this lemma. While we have stated it for the 
distance product, it should be noted that the same techniques also work for the Boolean 
semiring matrix product. 


Preliminaries. 

as 


For matrix S and index subsets U,W CP, we define the matrix S{U, W) 


S{U,W)u^ 


Suw if n G and w G W, 
oo otherwise. 


That is, we set all rows and columns not indexed by U and W to oo. As before, we use * 
as a shorthand for the whole index set V. 


Finding unique witnesses. As a first step, we compute witnesses for all {u, v) that 
have a unique witness, that is, there is exactly one index w such that {S kT)[u,v\ = 
^[u, re] + T[w,v]. To construct a candidate witness matrix Q, let C P be the set of 
indices v such that bit i in the binary presentation of u is 1. For i = 1, 2,..., [log n], we 
compute the distance product PW = p) *T(p, *) If Puv = {SkT)uv, then we set the 
bit of Quv to 1, and otherwise we set it to 0. 

If there is a unique witness for {u,v), then Quv is correct, and we can check if the 
candidate witness Quv = w is correct by computing Suw + Twv The algorithm clearly uses 
O(logn) matrix multiplications. 

Finding witnesses in the general case. To find witnesses for all indices {u,v), we 
reduce the general case to the case of unique witnesses. For simplicity, we only present a 
randomised version of this algorithm; for derandomisation see Zwick [76] and Alon and 
Naor [4]. 

Let i G {0,1,..., [logn] — 1}. We use the following procedure to attempt to find 
witnesses for all {u,v) that have exactly r witnesses for <r< n/2*: 
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1. Let m = [clogn] for a sufficiently large constant c. For j = 1, 2,..., m, construct a 
subset Vj hy picking 2* values vi,V 2 , ■ ■ ■ ■,V 2 i from V with replacement, and let 
Vj = {vuV2, . ■ ■,V2i}. 

2. For each Vj, use the unique witness detection for the product S{*,Vj) -kTiVj,*) 
to find candidate witnesses Quv for all pairs {u,v), and keep those Quv that are 
witnesses for S -kT. 

Let {u,v) be a pair with r witnesses for n/2*"*'^ < r < For each j = 1, 2,..., m, 

the probability that Vj contains exactly one witness for {u,v) is at least (2e)“^ (see 

Seidel [651). Thus, the probability that we do not hnd a witness for iu, v) is bounded by 
(1 - 

Repeating the above procedure for i = 0,1,..., [log re] — 1 ensures that the probability 
of not finding a witness for any hxed (re, re) is at most . By the union bound, the 

probability that there is any pair of indices (re, re) for which no witness is found is re”^*-'^^, 

i.e., with high probability the algorithm succeeds. Moreover, the total number of calls to 
the distance product is 0((logre)^), giving Lemma 21. 

4 Lower Bounds 

Lower bounds for matrix multiplication implementations. While proving uncon¬ 
ditional lower bounds for matrix multiplication in the congested clique model seems to be 
beyond the reach of current techniques, as discussed in Section 1.4, it can be shown that 
the results given in Theorem 1 are essentially optimal distributed implementations of the 
corresponding centralised algorithms. To be more formal, let C be an arithmetic circuit for 
matrix multiplication; we say that an implementation of C in the congested clique model 
is a mapping of the gates of C to the nodes of the congested clique. This naturally defines 
a congested clique algorithm for matrix multiplication, with the wires in C between gates 
assigned to different nodes defining the communication cost of the algorithm. 

Various authors, considering different parallel models, have shown that in any im¬ 
plementation of the trivial ©(re^) matrix multiplication on a parallel machine with P 
processors there is at least one processor that has to send or receive ^l{n ?/matrix 
entries [2, 41, 69]. As these models can simulate the congested clique, a similar lower 
bound holds for congested clique implementations of the trivial O(re^) matrix multiplication. 
In the congested clique, each processor sends and receives re messages per round (up to 
logarithmic factors) and P = n, yielding a lower bound of fi(re^/^) rounds. 

The trivial 0(re^) matrix multiplication is optimal for circuits using only semiring 
addition and multiplication. The task of re x re matrix multiplication over the min-plus 
semiring can be reduced to APSP with a constant blowup [3, pp.202-205], hence the above 
bound applies also to any APSP algorithm that only uses minimum and addition operations. 
This means that current techniques for similar problems, like the one used in the fast MST 
algorithm of Lotker et al. [51] cannot be extended to solve APSP. 

Corollary 22. Any implementation of the trivial 0(re^) matrix multiplication, and any 
APSP algorithm which only sums weights and takes the minimum of such sums, require 
Q(rei/3) communication rounds in the congested clique model. 
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However, known results on centralised APSP and distance product computation give 
reasons to suspect that this bound can be broken if we allow subtraction; in particu¬ 
lar, translating the recent result of Williams [73] might allow for running time of order 
j 2 ^W^ogn) APSP in the congested clique. 

Concerning fast matrix multiplication algorithms, Ballard et al. [8] have proven lower 
bounds for parallel implementations of Strassen-like algorithms. Their seminal work is 
based on building a DAG representing the linear combinations of the inputs before the 
block multiplications, and the linear combinations of the results of the multiplications 
(“decoding”) as the output matrix. The parallel computation induces an assignment of 
the graph vertices to the processes, and the edges crossing the partition represent the 
communication. Using an expansion argument, Ballard et al. show that in any partition a 
graph representing an algorithm there is a process communicating values. 

See also [9] for a concise account of the technique. 

The lower bound holds for Strassen’s algorithm, and for a family of similar algorithms, 
but not for any matrix multiplication algorithm (See [8, §. 5.1.1]). A matrix multiplication 
algorithm is said to be Strassen-like if it is recursive, its decoding graph discussed above is 
connected, and it computes no scalar multiplication twice. As each process communicates 
at most 0{n) values in a round, the implementation of an Q(n^) strassen-like algorithm 
must take D(n^“^/'^) rounds. 

Corollary 23. Any implementation of a Strassen-like matrix multiplication algorithm using 
Q.{n^) element multiplications requires D(n^“^/'^) communication rounds in the congested 
clique model. 

Lower bound for broadcast congested clique. Recall that the broadcast congested 
clique is a version of the congested clique model with the additional constraint that all 
n — 1 messages sent by a node in a round must be identical. 

Frischknecht et al. [31] have shown that approximating the diameter of an unweighted 
graph any better than factor 3/2 requires kl{n) rounds in the CONGEST model; the same 
can be applied to the broadcast congested clique. A variation of the approach was recently 
used by Holzer and Pinsker [38] to show that computing any approximation better than 
factor 2 to all-pairs shortest paths in weighted graphs takes D(n) rounds as well. As 
discussed in Section 3.3, o(n)-round matrix multiplication algorithms imply o(n)-round 
algorithms for exact unweighted and (1-f o(l))-approximate weighted APSP. Together, this 
immediately implies that matrix multiplication on the broadcast congested clique is hard. 

Corollary 24. In the broadcast congested clique model, matrix multiplication algorithms 
that are applicable to matrices over the Boolean semiring and APSP algorithms require 
Q{n) communication rounds. 

We remark that the phrase “that is applicable to matrices over the Boolean semiring” 
refers to the issue that, in principle, it is possible that matrix multiplication exponents 
may be different for different nnderlying semirings. However, at the very least the lower 
bound applies matrix multiplication over Booleans, integers, and rationals, as well as the 
min-plus semiring. We stress that, unlike the lower bounds presented beforehand, this 
bound holds without any assumptions on the algorithm itself. 
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5 Conclusions 


In this work, we demonstrate that algebraic methods - especially fast matrix multiplication 
- can be used to design efficient algorithms in the congested clique model, resulting in 
algorithms that outperform the previous combinatorial algorithms; moreover, we have 
certainly not exhausted the known centralised literature of algorithms based on matrix 
multiplication, so similar techniques should also give improvements for other problems. 
It also remains open whether corresponding lower bounds exist; however, it increasingly 
looks like lower bounds for the congested clique would imply lower bounds for centralised 
algorithms, and are thus significantly more difficult to prove than for the CONGEST 
model. 

While the present work focuses on a fully connected communication topology (clique), 
we expect that the same techniques can be applied more generally in the usual CONGEST 
model. For example, fast triangle detection in the CONGEST model is trivial in those 
areas of the network that are sparse. Only dense areas of the network are non-trivial, 
and in those areas we may have enough overall bandwidth for fast matrix multiplication 
algorithms. On the other hand, there are non-trivial lower bounds for distance computation 
problems in the CONGEST model [23], though significant gaps still remain [57]. 
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